JOURNAL OF RESEARCH of the National Bureau of Standards 
Vol. 83, No. 5, September-October 1978 



Hashing with Linear Probing and Frequency Ordering 



Gordon Lyon 

Institute for Computer Sciences and Technology, National Bureau of Standards, Washington, D.C. 20234 

(June 5, 1978) 

A simple linear probing and exchanging method of Burkhard locally rearranges hash tables to account for 
reference frequencies. Examples demonstrate how frequency-sensitive rearrangements that depend upon linear 
probing can significantly enhance searches. 
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1. Linear Probing 

Linear probing oi a scalier (or hash) table interprets each 
key or item (these terms are interchangeable here) as a probe 
index into the table | l|.' Typically, a key is divided by the 
table size and the remainder is used for indexing. If the 
selected slot is empty, the item is not present. Should the 
slot contain some other key, each next higher location is 
checked until the item is found, an empty slot is discovered, 
or the whole table has been examined. (Indexes that exceed 
table sizes wrap around.) Table slots thus searched define a 
key's collision resolution sequence. 

Linear probing is generall) not suitable for nearly filled 
tables, since as empty slots disappear, searches get very 
long. Nonetheless, Linear probing can be improved by 
allowing for item frequency-of-reference. 

1.1. Burkhard's Heuristic 

Recently, Burkhard has suggested an heuristic method of 
reordering scatter tables that are accessed via linear probing 
[2]. BurkhanTs scheme depends upon the item ensemble E 
in a table having nonuniform access frequencies. Each 
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reference to an item initiates an exchange of the item with its 
immediate predecessor on the collision resolution sequence, 
provided this is possible. Intuitively, one can sec that more 
frequent l\ accessed items remain near their original probes 
into the table, whereas unpopular table members migrate to 
poorer locations. The cost of exchanges can be reduced, for 
example, by performing them onl) on ever) tenth table 



2. Theoretical Limitations 

A number of recent studies examine limitations that exist 
on orderings of open addressing hash tables [3, 4, 5]. A 
typical limitation is that, given uniform frequencies, mini- 
mum average retrievals (in probes-per-item) are 4(0.8) = 

1.49,4(0.9) = 1.61,4(0.95) - L.69, amM(I.O) = 1.83 for 
table loadings of 0.8, 0.9, 0.95 and 1.0. (Entries i 1 1 table 1, 
(I = 0.0, show how ordinary linear probing compares.) Such 
values of A( ) are determined by optimal solutions to "distri- 
bution" or "assignment" problems common in operations 
research. However, accounting for skewed frequencies can 
be as significant as using "assignment problem" solutions, 
which are also expensive for tables of several hundred 
entries. The potential advantages of rearrangements by 
frequencies include simplicity, low insertion costs, and 
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Table I. The influence of d. 



0.0 



o.2r> 



0.50 



0.75 



1.0 



1 (0.50) 

A (0.90) 

A (0.95) 

■1 (0.00) 



1. 50 


1.44 


1.39 


1.33 


1.27 


5.50 


4.68 


3.87 


3.05 


2.23 


10.50 


8.57 


6.63 


4.70 


2.77 


50.50 


38.92 


27.30 


15.77 


4.10 



445 



adaptability. Population frequencies can even change with 
Burkhard's scheme. Weighting by frequency and "assign- 
ment" solutions can be combined, but the net improvement 
is often not nearly as great as each component might suggest. 
Given a sufficiently skewed population, it is most important 
to attend to the frequency weights [5]. 

2.1. Descending-Popularity Insertions 



Example 1 . Imagine an ensemble with a sorted frequency 
distribution of f(x) = 1 4- d — (2dx). Such a population 
corresponds to a tilting or skewing of the usual uniform 
distribution (d — 0). Maximum skewedness (d — 1) gives a 
sawtooth distribution. Applying (i) and integrating directly, 

A{a) s (V 2 )*[l + (1 - d)/(a*{\ ~ a)) 
- (1 + d)/a - (2*d)/(a*a)* log(l - a)]. 



Peterson has proven that equiprobable items are insensi- 
tive to insertion order when linear probing is used [6]. A 
variation in argument shows that items inserted in order of 
decreasing frequencies achieve minimum average retrieval 
costs for their ensemble E [1, 3]. Assume that a partially 
filled table is packed optimally. Consider a new item of 
frequency equal to or less than any other item already in the 
table. Let the new item be contending for a table slot that is 
filled. By the nature of linear probing, both the new and the 
resident will probe the same filled slots in a search for an 
empty alternative slot. Although the increases in search 
probes are identical, the extra probes should be assigned to 
the least frequent, which is the new item. 

2.2. Large Tables 

Continuous distributions and integrals provide good ap- 
proximate results when tables are large. Let f(x) be a 
distribution of ensemble items such that 
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f(x)dx = 1 and /(*) < 



f(x) represents a sorted order of descending frequencies. The 
expected probes-to-insert-an-item is approximated well for 
linear probing by [1]: 

C'{a) = (V 2 )*(l + (1/(1 - a))**2) 

"a" indicates the table loading, i.e., the ratio of \E \ to the 
table size. Insertion of the full ensemble E into a table gives 
an optimal retrieval average of 



A(a) = f 
Jo 



f(x)*C'(axYdx. 



The marginal insertion cost C' is sensitive to the true 
occupied table fraction "ax" rather than x alone. Since the 
ensemble probability is unity, the retrieval expression can be 
simplified slightly to 
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The effect of "d" is pronounced and useful, as demonstrated 
in table 1. Improvements are linear in d. 

3. A Practical Estimation Technique 

In many cases it is illuminating to apply (i) to estimate 
retrieval prospects for observed data. Empirical frequencies 
do not always fit common curves, e.g., data are discontinuous 
or derive from multimodal distributions. Nevertheless, (i) 
takes a very simple tabular form when a distribution f(x) 
comprises segments of straight lines. Such lines are easily 
drawn on a data frequency plot, and the necessary end points 
and slopes read directly once the plot area has been 
normalized to unity. Expanding (i) by parts and noting that 
f'(x) is constant for straight lines: 
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Then breaking the interval [0, 1] into line segments {/}, each 
denoted [/_, /+], 
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Example 2 — (Linear segments). The tabulation technique 
is easily applied to problems. Lelf(x) = 3 when < x < 
0.25, f(x) = 0.50 when 0.25 < x < 0.50, andf(x) = 1 - 
x otherwise. Note that /(:*;) integrates to unity on the interval 
[0, 1] and that f(x + e) < f(x). Applying the tabulation 
formula, 
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Retrieval values for the example are/4 (0.5) = 1.17, A (0.9) 
= 1.68, A (0.95) = 1.94, and A (0.99) = 2.66. 

4. Summary 

Theoretical examples demonstrate that table reorderings 
make excellent improvements on linear probing, especially 
in everyday applications thai more often than not have 
distinctly unequal reference frequencies. The simplicity and 
low costs of linear (exchange) probing make it an attractive 
possibility for practical applications. 
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